Downloading SRA reads from archives

Sources of reads

Microbiome read sequencing data may be obtained from different sources. The most common ones include:

  1. Reads obtained directly from a sequencing platforms by investigators.
  2. Reads downloaded from the Sequence Read Archive (SRA) or the European Nucleotide Archive (ENA).
  3. Reads synthesized using sequencing simulators.


Snakemake workflow for downloading SRA reads

A tentative snakemake workflow that defines rules for downloading fastq sequences from SRA in a DAG (directed acyclic graph) format. A detailed interactive snakemake report is available here.


Installing SRA Toolkit

  • Navigate to where you want to install the tools, preferably the home directory.
  • For more information click here.

Demo on MAC OS

curl -LO  https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.0/sratoolkit.3.0.0-mac64.tar.gz
tar -xf sratoolkit.3.0.0-mac64.tar.gz
export PATH=$HOME/sratoolkit.3.0.0-mac64/bin/:$PATH

Create a cache root directory

mkdir -p ~/ncbi
echo '/repository/user/main/public/root = "cache_directory"' > ~/ncbi/user-settings.mkfg

Confirm sra toolkit configuration

  • The vdb-config -i command below will display a blue colored dialog.
  • Use tab or click c to navigate to cache tab.
  • Review the configuration then save s and exit x.
vdb-config -i

A screenshot of the SRA configuration.


For more information click here.

Alternative method

We can create an environment and install essential tools in it. Example, sradb using environment.yml.

name: sradb
channels:
  - conda-forge
  - bioconda
dependencies:
  - sra-tools
  - entrez-direct
  - pysradb
mamba create -c bioconda -c conda-forge sradb -file environment.yml


Downloading multiple fastq files

  • Make sure that the fasterq-dump is in the path.
  • Type which fasterq-dump or fasterq-dump --help to confirm.
  • Must specify the output and temporary files.
  • It is possible to specifies a range of SRA accessions in a for loop.

Example code for download reads for SRA accessions ranging from SRR7450706 to SRR7450761

for (( i = 706; i <= 761; i++ ))
    do
        time fasterq-dump SRR7450$i \
        --split-3 \
        --force \
        --skip-technical \
        --outdir data/reads \
        --temp data/temp \
        --threads 4     
    done


Compressing and uncompressing files

The microbiome fastq files are usually very large. Compressing them may save lots of space.

Uncompressing with bash

gunzip data/reads/*.gz


Compressing with bash

gzip data/reads/*.fastq


Resizing Fastq files

  • Sometimes we want to extract a small subset to test the bioinformatics pipeline.
  • You can resize the fastq files using the seqkit sample function[1].

Example extracting only 1% of the paired-end metagenomics sequencing data.

This bash script extracts 1% of the reads from only two sample (SRR10245277 to SRR10245280)

mkdir -p data
for i in {77..80}
  do
    cat SRR102452$i\_1.fastq \
    | seqkit sample -p 0.01 \
    | seqkit shuffle -o data/SRR102452$i\_1_sub.fastq \
    | cat SRR102452$i\_2.fastq \
    | seqkit sample -p 0.01 \
    | seqkit shuffle -o data/SRR102452$i\_2_sub.fastq
  done



References

[1]
Shen, W. (2022). SeqKit - ultrafast FASTA/q kit. Retrieved from https://bioinf.shenwei.me/seqkit/
[2]
Buza, T. M., Tonui, T., Stomeo, F., Tiambo, C., Katani, R., Schilling, M., … Kapur, V. (2019). iMAP: An integrated bioinformatics and visualization pipeline for microbiome data analysis. BMC Bioinformatics, 20. https://doi.org/10.1186/S12859-019-2965-4



Appendix

Project main tree

.
├── LICENSE
├── README.md
├── config
│   ├── config.yaml
│   └── samples.tsv
├── dags
│   ├── rulegraph.png
│   └── rulegraph.svg
├── data
│   ├── metadata
│   ├── reads
│   ├── temp
│   └── test
├── docs
│   └── env_spec_file.txt
├── images
│   ├── smkreport
│   ├── sra.png
│   └── sra_config_cache.png
├── index.Rmd
├── library
│   ├── apa.csl
│   ├── imap.bib
│   └── references.bib
├── report.html
├── results
│   ├── project_tree.txt
│   └── run_accessions.txt
├── styles.css
└── workflow
    ├── Snakefile
    ├── envs
    ├── reports
    ├── rules
    ├── schemas
    └── scripts

18 directories, 18 files

Screenshot of interactive snakemake report

The interactive snakemake HTML report can be viewed by opening the report.html using any compatible browser. You will be able to explore the workflow and the associated statistics. You can close the left bar to get a more expansive display view.

Troubleshooting of FAQs

  1. Question
    • Answer
  2. Question
    • Answer